- 
                Notifications
    You must be signed in to change notification settings 
- Fork 15k
[MLIR] Modify lowering of gpu.alloc op to llvm #69969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @grypp tagging for review | 
| @llvm/pr-subscribers-mlir-gpu @llvm/pr-subscribers-mlir Author: Nishant Patel (nbpatel) ChangesIf gpu.alloc has no asyn deependency ( this is true if gpu.alloc has hostShared allocation), create a new stream. This PR is follow up to #66401 Full diff: https://github.com/llvm/llvm-project/pull/69969.diff 1 Files Affected: 
 diff --git a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
index 097caf23edfa5dd..da1c468ed1dfd71 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
@@ -836,7 +836,11 @@ LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
   // Allocate the underlying buffer and store a pointer to it in the MemRef
   // descriptor.
   Type elementPtrType = this->getElementPtrType(memRefType);
-  auto stream = adaptor.getAsyncDependencies().front();
+
+  Value stream =
+      adaptor.getAsyncDependencies().empty()
+          ? streamCreateCallBuilder.create(loc, rewriter, {}).getResult()
+          : adaptor.getAsyncDependencies().front();
 
   auto isHostShared = rewriter.create<mlir::LLVM::ConstantOp>(
       loc, llvmInt8Type, rewriter.getI8IntegerAttr(isShared));
@@ -855,7 +859,17 @@ LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
   auto memRefDescriptor = this->createMemRefDescriptor(
       loc, memRefType, allocatedPtr, alignedPtr, shape, strides, rewriter);
 
-  rewriter.replaceOp(allocOp, {memRefDescriptor, stream});
+  if (allocOp.getAsyncToken()) {
+    // Async alloc: make dependent ops use the same stream.
+    rewriter.replaceOp(allocOp, {memRefDescriptor, stream});
+  } else {
+    // Synchronize with host and destroy stream. This must be the stream created
+    // above (with no other uses) because we check that the synchronous version
+    // does not have any async dependencies.
+    streamSynchronizeCallBuilder.create(loc, rewriter, {stream});
+    streamDestroyCallBuilder.create(loc, rewriter, {stream});
+    rewriter.replaceOp(allocOp, {memRefDescriptor});
+  }
 
   return success();
 }
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about this PR. It creates and syncs a stream, but it's never used in our runtimes, including the new sycl runtime.
We should avoid creating and syncing a stream when it's not necessary, as it has overhead.
| auto stream = adaptor.getAsyncDependencies().front(); | ||
|  | ||
| Value stream = | ||
| adaptor.getAsyncDependencies().empty() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unclear of creating stream. If the stream isn't available, my understanding is that we would prefer synchronous execution. In such a scenario, we wouldn't need the stream. At least in CUDA model, you don't have to do that, the cuda driver/runtime can use existing default stream implicitly.
Can we pass nullptr for stream when it's not available? Alternatively, we might create another API in the runtimes.
Afaik, creating and synchronizing stream is not free. We better avoid if it is not necessary.
For example, sycl runtime (below) that doesn't utilize the 'stream' (aka sycl::queue *queue).
static void *allocDeviceMemory(sycl::queue *queue, size_t size, bool isShared) {
  void *memPtr = nullptr;
  if (isShared) {
    memPtr = sycl::aligned_alloc_shared(64, size, getDefaultDevice(), getDefaultContext());
  } else {
    memPtr = sycl::aligned_alloc_device(64, size, getDefaultDevice(), getDefaultContext());
  }
  if (memPtr == nullptr) {
    throw std::runtime_error("mem allocation failed!");
  }
  return memPtr;
}
mgpuMemAlloc(uint64_t size, sycl::queue *queue, bool isShared) {
  return catchAll([&]() {
    return allocDeviceMemory(queue, static_cast<size_t>(size), true);
  });
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I agree. I made the change. I added the stream in the first place is because all the runtime signatures for mgpuMemAlloc expects a stream (albeit it doesn't use it now) ...but in future if they do require the stream passed for allocation ...but I agree for now we can pass nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I agree. I made the change.
Thanks for doing that! It looks better now. We just need a test for this PR as it’s regular llvm code submission policy.
I added the stream in the first place is because all the runtime signatures for mgpuMemAlloc expects a stream (albeit it doesn't use it now) ...but in future if they do require the stream passed for allocation ...but I agree for now we can pass nullptr
Would you like to separate runtime APIs into mgpuMemAlloc and mgpuMemAllocAsync? The first one doesn't need a stream in the function declaration, while the second one works asynchronously and requires a stream. This is an additional work if you are interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test. Yes I am interested to do it but in a separate PR once all the PR's for #65539 are merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks looks good
| @grypp can you help me merge this? | 
If gpu.alloc has no asyn deependency ( in case if gpu.alloc has hostShared allocation), create a new stream & synchronize. This PR is follow up to llvm#66401
If gpu.alloc has no asyn deependency ( in case if gpu.alloc has hostShared allocation), create a new stream & synchronize. This PR is follow up to #66401